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Abstract 

In applications such as recommendation systems and revenue management, it is important to 
predict preferences on items that have not been seen by a user or predict outcomes of comparisons 
among those that have never been compared. A popular discrete choice model of multinomial 
logit model captures the structure of the hidden preferences with a low-rank matrix. In order to 
predict the preferences, we want to learn the underlying model from noisy observations of the 
low-rank matrix, collected as revealed preferences in various forms of ordinal data. A natural 
approach to learn such a model is to solve a convex relaxation of nuclear norm minimization. 
We present the convex relaxation approach in two contexts of interest: collaborative ranking 
and bundled choice modeling. In both cases, we show that the convex relaxation is minimax 
optimal. We prove an upper bound on the resulting error with finite samples, and provide a 
matching information-theoretic lower bound. 


1 Introduction 

In many applications such as recommendation systems and revenue management, it is important to 
predict preferences on items that have not been seen by a user or predict outcomes of comparisons 
among those that have never been compared. Predicting such hidden preferences would be hopeless 
without further assumptions on the structure of the preference. Motivated by the success of matrix 
factorization models on collaborative filtering applications, we model hidden preferences with low- 
rank matrices to collaboratively learn preference matrices from ordinal data. In this paper, we 
consider the following two concrete scenarios: 

• Collaborative ranking. Consider an online market that collects each user’s preference as a 
ranking over a subset of items that are ‘seen’ by the user. Such data can be obtained by 
directly asking to compare some items, or by indirectly tracking online activities on which 
items are viewed, how much time is spent on the page, or how the user rated the items. In 
order to make personalized recommendations, we want (a) a model that captures how users 
who preferred similar items are also likely to have similar preferences on unseen items; and 
(6) to predict which items the user might prefer, by learning such models from ordinal data. 
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• Bundled choice modeling. Discrete choice models describe how a user makes decisions on what 
to purchase. Typical choice models assume the willingness to buy an item is independent of 
what else the user bought. In many cases, however, we make ‘bundled’ purchases: we buy 
particular ingredients together for one recipe or we buy two connecting flights. One choice 
(the first flight) has a significant impact on the other (the connecting flight). In order to 
optimize the assortment (which flight schedules to offer) for maximum expected revenue, it 
is crucial to accurately predict the willingness of the consumers to purchase, based on past 
history. We consider a case where there are two types of products (e.g. jeans and shirts), and 
want (a) a model that captures such interacting preferences for pairs of items, one from each 
category; and (6) to predict the consumer’s choice probabilities on pairs of items, by learning 
such models from past purchase history. 


We use a discrete choice model known as MultiNomial Logit (MNL) model (described in Section 
2.1) to represent the preferences. In collaborative ranking context, MNL uses a low-rank matrix 


to represent the hidden preferences of the users. Each row corresponds to a user’s preference over 
all the items, and when presented with a subset of items the user provides a ranking over those 
items, which is a noisy version of the hidden true preference, the low-rank assumption naturally 
captures the similarities among users and items, by representing each on a low-dimensional space. 
In bundled choice modeling context, the low-rank matrix now represents how pairs of items are 
matched. Each row corresponds to an item from the first category and each column corresponds to 
an item from the second category. An entry in the matrix represents how much the pair is preferred 
by a randomly chosen user from a pool of users. Notice that in this case we do not model individual 
preferences, but the preference of the whole population. The purchase history of the population is 
the record of which pair was chosen among a subsets of items that were presented, which is again 
a noisy version of the hidden true preference. The low-rank assumption captures the similarities 
and dis-similarities among the items in the same category and the interactions across categories. 

Contribution. A natural approach to learn such a low-rank model, from noisy observations, 
is to solve a convex relaxation of nuclear norm minimization (described in Section 2.2). We present 
such an approach for learning the MNL model from ordinal data, in two contexts: collaborative 
ranking and bundled choice modeling. In both cases, we analyze the sample complexity of the 
algorithm, and provide an upper bound on the resulting error with finite samples. We prove 
minimax-optimality of our approach by providing a matching information-theoretic lower bound 
(up to a poly-logarithmic factor). Technically, we utilize the Random Utility Model (RUM) in¬ 
terpretation (outlined in Section 2.1) of the MNL model to prove both the upper bound and the 
fundamental limit, which could be of interest to analyzing more general class of RUMs. 

Related work. In the context of collaborative ranking, MNL models have been proposed 
to model partial rankings from a pool of users. Existing work is limited to the case when each 
user provides pair-wise comparisons mm- m proposes solving a convex relaxation of maximizing 
the likelihood over matrices with bounded nuclear norm. It is shown that this approach achieves 
statistically optimal generalization error rate. Our analysis techniques are inspired by [I], which 
proposed the convex relaxation similar to ours, but when the users provide only pair-wise com¬ 
parisons. For pairwise comparisons, our main result in Theorem matches those of pQ, but our 
result is more general in the sense that we analyze more general sampling models beyond pairwise 
comparisons. In general, “collaborative ranking” has been used typically to refer to the problem of 
learning personal rankings when the data is ratings on items (as opposed to ordinal data). Matrix 
factorization approaches have been widely applied in practice lam, but no theoretical guarantees 
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are known. 

The remainder of the paper is organized as follows. In Section we present the MNL model 
and propose a convex relaxation for learning the model, in the context of collaborative ranking. We 
provide theoretical guarantees for collaborative ranking in Section In Section we present the 
problem statement for bundled choice modeling, and analyze a similar convex relaxation approach. 
Notations. We use |||A|||pand |||^|||ooto denote the Frobenius norm and the foo norm, |||^|||nuc ~ 
ai{A) to denote the nuclear norm where c7i{A) denote the i-th singular value, and |||^|||2 = ci(^) 
for the spectral norm. We use {{u,v)) = denote the inner product and the 

Euclidean norm. All ones vector is denoted by 1 and 1(A) is the indicator function of the event A. 
The set of the fist N integers are denoted by [A^] = {!,..., N}. 

2 Model and Algorithm 

In this section, we present a discrete choice modeling for collaborative ranking, and propose an 
inference algorithm for learning the model from ordinal data. 

2.1 MultiNomial Logit (MNL) model for comparative judgment 

In collaborative ranking, we want to model how people who have similar preferences on a subset 
of items are likely to have similar tastes on other items as well. When users provide ratings, as in 
collaborative filtering applications, matrix factorization models are widely used since the low-rank 
structure captures the similarities between users. When users provide ordered preferences, we use 
a discrete choice model known as MultiNomial Logit (MNL) model that has a similar low-rank 
structure that captures the similarities between users and items. 

Let 0* be the di x d 2 dimensional matrix capturing the preference of diusers on d 2 items, where 
the rows and columns correspond to users and items, respectively. Typically, 0* is assumed to be 
low-rank, having a rank r that is much smaller than the dimensions. However, in the following we 
allow a more general setting where 0* might be only approximately low rank. When a user i is 
presented with a set of alternatives Si C [^ 2 ], she reveals her preferences as a ranked list over those 
items. To simplify the notations we assume all users compare the same number k of items, but 
the analysis naturally generalizes to the case when the size might differ from a user to a user. Let 
Vi/ G Si denote the (random) ^-th best choice of user i. Each user gives a ranking, independent of 
other users’ rankings, from 


IP{Wi 




}=n 


0 * 




'=1 




( 1 ) 


where with Si/ = Si \ {vi/,... ,Vi/-i}. For a single user i, the i-th row of 0* represents the 
underlying preference vector of the user, and the more preferred items are more likely to be ranked 
higher. The probabilistic nature of the model captures the noise in the users revealed preferences. 

The random utility model (RUM), pioneered by [3 El E], describes the choices of users as 
manifestations of the underlying utilities. The MNL models is a special case of RUM where each 
decision maker and each alternative are represented by a r-dimensional feature vectors Ui and Vj 
respectively, such that 0*j = {{ui,Vj)), resulting in a low-rank matrix. When presented with a set of 
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alternatives Si, the decision maker i ranks the alternatives according to their random utility drawn 
from 


Uij — {{ui,Vj)) + ^ij , ( 2 ) 

for item j, where ^ij follow the standard Gumbel distribution. Intuitively, this provides a justifi¬ 
cation for the MNL model as modeling the decision makers as rational being, seeking to maximize 
utility. Technically, this RUM interpretation plays a crucial role in our analysis, in proving restricted 
strong convexity in Appendix |A.4| and also in proving fundamental limit in Appendix 

There are a few cases where the Maximum Likelihood (ML) estimation for RUM is tractable. 
One notable example is the Plackett-Luce (PL) model, which is a special case of the MNL model 
where 0* is rank-one and all users have the same features. PL model has been widely applied in 
econometrics [8], analyzing elections [9], and machine learning [TOj. Efficient inference algorithms 
has been proposed mi ESI US], and the sample complexity has been analyzed for the MLE HU and 
for the Rank Centrality m- Although PL is quite restrictive, in the sense that it assumes all users 
share the same features, little is known about inference in RUMs beyond PL. Recently, to overcome 
such a restriction, mixed PL models have been studied, where 0* is rank-r but there are only r 
classes of users and all users in the same class have the same features. Efficient inference algorithms 
with provable guarantees have been proposed by applying recent advances in tensor decomposition 
methods nail], directly clustering the users Ha EH], or using sampling methods [20] • However, 
this mixture PL is still restrictive, and both clustering and tensor based approaches rely heavily on 
the fact that the distribution is a “mixture” and require additional incoherence assumptions on 0*. 
For more general models, efficient inference algorithms have been proposed HI] but no performance 
guarantee is known for finite samples. Although the MLE for the general MNL model in Q is 
intractable, we provide a polynomial-time inference algorithm with provable guarantees. 


2.2 Nuclear norm minimization 


Assuming 0* is well approximated by a low-rank matrix, we estimate 0* by solving the following 
convex relaxation given the observed preference in the form of ranked lists {(ujq,... ,Vi^k)}ie[di]- 


0 e argmm£(0) A|||0|||„,^^, 


where the (negative) log likelihood function according to Q is 




. di k / / 

^ Z] - log exp (((0, eiej))) 

1 i=i e=i \ ’ \j&s,_i 


(3) 


(4) 


with Si = {vi^i, ..., Vi^k} and Si^i ^ Si \ {uiq, ..., and appropriately chosen set H defined in 

Q. Since nuclear norm is a tight convex surrogate for the rank, the above optimization searches 
for a low-rank solution that maximizes the likelihood. Nuclear norm minimization has been widely 
used in rank minimization problems [22] , but provable guarantees typically exists only for quadratic 
loss function T(0) HUEU- analysis extends such analysis techniques to identify the conditions 
under which restricted strong convexity is satished for a convex loss function that is not quadratic. 
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3 Collaborative ranking from k-wise comparisons 

We first provide background on the MNL model, and then present main results on the performance 
guarantees. Notice that the distribution Q is independent of shifting each row of 0* by a constant. 
Hence, there is an equivalent class of 0* that gives the same distributions for the ranked lists: 

[0*] = {Ag I H = 0* + for some u G . (5) 

Since we can only estimate 0* up to this equivalent class, we search for the one whose rows sum to 
zero, i.e. Ylje[d 2 ] ~ ^ ^ ^ “ ®ij 2 \ denote the dynamic range 

of the underlying 0*, such that when k items are compared, we always have 

, ( 6 ) 

for all j G Si, all Si C [^2] satisfying \Si\ = k and all i G [di]. We do not make any assumptions on 
a other than that a = 0(1) with respect to di and ^ 2 - The purpose of defining the dynamic range 
in this way is that we seek to characterize how the error scales with a. Given this definition, we 
solve the optimization in Q over 

Hq, = G I |||H|||^ < a, and Vi G [di] we have ^ Aij = o| . (7) 

j&[d2\ 

While in practice we do not require the ^oo norm constraint, we need it for the analysis. For a 
related problem of matrix completion, where the loss C{9) is quadratic, either a similar condition 
on ^00 norm is required or a different condition on incoherence is required. 


3.1 Performance guarantee 


We provide an upper bound on the resulting error of our convex relaxation, when a multiset of 
items Si presented to user i is drawn uniformly at random with replacement. Precisely, for a given 
k, Si = ..., where ji/s are independently drawn uniformly at random over the ^2 items. 

Further, if an item is sampled more than once, i.e. if there exists ji/-^ = ji/^ for some i and ii ^ (- 2 , 
then we assume that the user treats these two items as if they are two distinct items with the 
same MNL weights 0| ^ -The resulting preference is therefore always over k items (with 

possibly multiple copies of the same item), and distributed according to Q. For example, if A: = 3, 
it is possible to have Si = {jjq = 1, ji ,2 = IjJi.s = 2}, in which case the resulting ranking can be 


(t'i.i = ji,i,Vi ,2 = ji,3,Vi,3 = ji, 2 ) with probability (e®»’i)/(2x (e'^*’2)/(e'^»’i + 

Such sampling with replacement is necessary for the analysis, where we require independence in 
the choice of the items in Si in order to apply the symmetrization technique (e.g. [5^) to bound 
the expectation of the deviation (cf. Appendix A.4). Similar sampling assumptions have been 
made in existing analyses on learning low-rank models from noisy observations, e.g. |24j . Let 
d = {di + ^ 2 ) 72 , and let (Tj( 0*) denote the j-th singular value of the matrix 0*. Define 


„e*.- 


„e*.- 


Ao = 


— „2o 


Idi logd -k d 2 (log d )2 


kd1 d2 
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Theorem 1. Under the described sampling model, assume 24 < fc < min{(i^, (df+ d|)/(2(ii)} log d, 
and A G [32Ao,coAo] with any constant cq = 0(1) larger than 32. Then, solving the optimization 
Q achieves 


1 

did2 


0 


0 ^ 


< 288\/2e^"coAoV^ 


0 - 0 * 


min{<ii ,^2} 

+ 288 e^“coAo ^ ^i(e*) , 

j=r+l 


( 8 ) 


for any r G {1,..., min{di, ^ 2 }} with probability at least 1 — 2d ^ where d = {di+ d2)/2. 

A proof is provided in Appendix The above bound shows a natural splitting of the error into 
two terms, one corresponding to the estimation error for the rank-r component and the second 
one corresponding to the approximation error for how well one can approximate 0* with a rank-r 
matrix. This bound holds for all values of r and one could potentially optimize over r. We show 
such results in the following corollaries. 


Corollary 3.1 (Exact low-rank matrices). Suppose 0* has rank at most r. Under the hy¬ 
potheses of Theorem^ solving the optimization Q with the choice of the regularization parameter 
A G [32Ao,coAo] achieves with probability at least 1 — 2d~^ , 


'/dfdf 


0 - 0 * 


< 288\/2e®“coi 


I r{di\ogd + d 2 (log d)^) 


k di 


(9) 


The number of entries is did 2 and we rescale the Frobenius norm error appropriately by 1/ y/did 2 . 
When 0* is a rank-r matrix, then the degrees of freedom in representing 0* is r(di -|- d 2 ) — r^ = 
0(r(di -|- d 2 )). The above theorem shows that the total number of samples, which is [kdi), needs 
to scale as 0(rdi(logd) -|- rd 2 (logd)^) in order to achieve an arbitrarily small error. This is only 
poly-logarithmic factor larger than the degrees of freedom. In Section [3.2[ we provide a lower bound 
on the error directly, that matches the upper bound up to a logarithmic factor. 

The dependence on the dynamic range a, however, is sub-optimal. It is expected that the error 
increases with a, since the 0* scales as a, but the exponential dependence in the bound seems 
to be a weakness of the analysis, as seen from numerical experiments in the right panel of Figure 

Although the error increase with a, numerical experiments suggests that it only increases at 
most linearly. However, tightening the scaling with respect to a is a challenging problem, and such 
sub-optimal dependence is also present in existing literature for learning even simpler models, such 
as the Bradley-Terry model m or the Plackett-Luce model [E], which are special cases of the 
MNL model studied in this paper. A practical issue in achieving the above rate is the choice of 
A, since the dynamic range a is not known in advance. Figure illustrates that the error is not 
sensitive to the choice of A for a wide range. 

Another issue is that the underlying matrix might not be exactly low rank. It is more realistic 
to assume that it is approximately low rank. Following [23! formalize this notion with “.^g-ball” 
of matrices defined as 

B,(p,) = {0 gM''i""M E (10 ) 

je[min{di,d2}] 

When q = 0, this is a set of rank-/?o matrices. For q G (0,1], this is set of matrices whose singular 
values decay relatively fast. 0ptimizing the choice of r in Theorem we get the following result. 
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Figure 1: The (rescaled) RMSE scales as y^r(log d)/k as expected from Corollary 3.1 for fixed 
d = 50 (left). In the inset, the same data is plotted versus rescaled sample size A:/(rlogd). The 
(rescaled) RMSE is stable for a broad range of A and a for fixed d = 50 and r = 3 (right). 


Corollary 3.2 (Approximately low-rank matrices). Suppose 0* G Mq{pq) for some q G (0,1] 
and pq > 0. Under the hypotheses of Theorem^ solving the optimization (© with the choice of the 
regularization parameter A G [32Ao,coAo] achieves with probability at least 1 — 2d~^, 


y/dfdf 




< 


y/didf 


288V2, 


Coe 


6 q: 


I did 2 (di logd + d 2 (logd) 2 ) 


2-g 

2 


k di 


( 11 ) 


This is a strict generalization of Corollary 3.1 


Eor q = 0 and po = r, this recovers the exact 
low-rank estimation bound up to a factor of two. For approximate low-rank matrices in an ^g-ball, 
we lose in the error exponent, which reduces from one to (2 — q)/2. A proof of this Corollary is 
provided in Appendix [B| 


The left panel of Figure confirms the scaling of the error rate as predicted by Corollary 3.1 


The lines merge to a single line when the sample size is rescaled appropriately. We make a choice of 
A = (1/2)-^ (log d)/{kdf), This choice is independent of a and is smaller than proposed in Theorem 
We generate random rank-r matrices of dimension d x d, where 0* = UV'^ with U G 
and V G entries generated i.i.d from uniform distribution over [0,1]. Then the row-mean is 
subtracted form each row, and then the whole matrix is scaled such that the largest entry is a = 5. 
The root mean squared error (RMSE) is plotted where RMSE = (l/d)|||0* — 0|||p. We implement 
and solve the convex optimization ^ using proximal gradient descent method as analyzed in [26]. 
The right panel in Eigure illustrates that the actual error is insensitive to the choice of A for a 
broad range of A G [y^(log d)/{kd?), 2^(log d)/{kd?y\, after which it increases with A. 


3.2 Information-theoretic lower bound for low-rank matrices 

Eor a polynomial-time algorithm of convex relaxation, we gave in the previous section a bound on 
the achievable error. We next compare this to the fundamental limit of this problem, by giving 
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a lower bound on the achievable error by any algorithm (efficient or not). A simple parameter 
counting argument indicates that it requires the number of samples to scale as the degrees of 
freedom i.e., kdi oc r{di + ^ 2 ), to estimate a di x d 2 dimensional matrix of rank r. We construct 
an appropriate packing over the set of low-rank matrices with bounded entries in defined as 
Q, and show that no algorithm can accurately estimate the true matrix with high probability 
using the generalized Fano’s inequality. This provides a constructive argument to lower bound 
the minimax error rate, which in turn establishes that the bounds in Theorem is sharp up to a 
logarithmic factor, and proves no other algorithm can significantly improve over the nuclear norm 
minimization. 


Theorem 2. Suppose 0* has rank r. Under the described sampling model, for large enough di and 
d 2 ^ di, there is a universal numerical constant c > 0 such that 


inf sup E 
0 0*er2a 


0 - 0 * 


> c min < ae 


r d 2 ad 2 1 
kdi ’ ^did 2 logd I 


( 12 ) 


-\/did2 1 F- 

where the infimum is taken over all measurable functions over the observed ranked lists {(u^q,..., Ujq)}, 


A proof of this theorem is provided in Appendix [C| The term of primary interest in this bound 
is the first one, which shows the scaling of the (rescaled) minimax rate as y^r{di + d 2 )/{kdi) 
(when d 2 > di), and matches the upper bound in Q. It is the dominant term in the bound 
whenever the number of samples is larger than the degrees of freedom by a logarithmic factor, i.e., 
kdi > r{di + ^ 2 ) logd, ignoring the dependence on a. This is a typical regime of interest, where 
the sample size is comparable to the latent dimension of the problem. In this regime. Theorem 
establishes that the upper bound in Theorem is minimax-optimal up to a logarithmic factor in 
the dimension d. 


4 Choice modeling for bundled purchase history 

In this section, we use the MNL model to study another scenario of practical interest: choice 
modeling from bundled purchase history. In this setting, we assume that we have bundled purchase 
history data from n users. Precisely, there are two categories of interest with di and ^2 alternatives 
in each category respectively. For example, there are di tooth pastes to choose from and ^2 tooth 
brushes to choose from. For the f-th user, a subset Si C [d^] of alternatives from the first category 
is presented along with a subset Ti C [d 2 ] of alternatives from the second category. We use ki and 
k 2 to denote the number of alternatives presented to a single user, i.e. ki = ISil and k 2 = |Ti|, and 
we assume that the number of alternatives presented to each user is fixed, to simplify notations. 
Given these sets of alternatives, each user makes a ‘bundled’ purchase and we use {ui, Vi) to denote 
the bundled pair of alternatives (e.g. a tooth brush and a tooth paste) purchased by the Tth user. 
Each user makes a choice of the best alternative, independent of other users’s choices, according to 
the MNL model as 

P{(ui,Ui) = (ji,^2)} = --©TT ’ 
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for all ji G Si and j 2 G Tj. The distribution (13) is independent of shifting all the values of 0* 
by a constant. Hence, there is an equivalent class of 0* that gives the same distribution for the 
choices: [0*] = {A G | ^ _ 0 * _|_ gQj^^g c G M} . Since we can only estimate 0* 

up to this equivalent class, we search for the one that sum to zero, i.e. Ylj 2 &[d 2 \ ®ii I 2 ~ 
Let a = £ 2 j'e[d 2 ] l®ji j 2 ~ y l> denote the dynamic range of the underlying 0*, such 

that when ki x k 2 alternatives are presented, we always have 


1 


kik2 


e " < P{(ui,Ui) = 0i,j2)} < 


kik2 


(14) 


for all (ji)jh) £ Si X Ti and for all Si C [di] and Tj C [^ 2 ] such that \Si\ = ki and |Tj| = k 2 - We 
do not make any assumptions on a other than that a = 0(1) with respect to di and ^ 2 - Assuming 
0 * is well approximate by a low-rank matrix, we solve the following convex relaxation, given the 
observed bundled purchase history {(uj, Uj, S'*, rj)}jg[„]: 


0 G 


arg min £(0) + A|||0||| 


nuc ’ 


(15) 


where the (negative) log likelihood function according to (13) is 


^(0) = -- - log ^ exp (((0, 

*=i \ \h&Si,j2&% 

= {AGM‘'i""^||||A|||^<a,and = o} . 

jl&[dl\d2&[d2\ 


, and 


(16) 

(17) 


Compared to collaborative ranking, (a) rows and columns of 0* correspond to an alternative 
from the first and second category, respectively; (6) each sample corresponds to the purchase choice 
of a user which follow the MNL model with 0*; (c) each person is presented subsets Si and Ti 
of items from each category; (d) each sampled data represents the most preferred bundled pair of 
alternatives. 


4.1 Performance guarantee 

We provide an upper bound on the error achieved by our convex relaxation, when the multiset 
of alternatives Si from the first category and Ti from the second category are drawn uniformly 
at random with replacement from [di] and [d 2 ] respectively. Precisely, for given ki and k 2 , we 
let Si = ..., and T* = where and are independently drawn 

uniformly at random over the di and d 2 alternatives, respectively. Similar to the previous section, 
this sampling with replacement is necessary for the analysis. Define 


Ai = 


max{di, d 2 } log d 


n di d 2 


(18) 


Theorem 3. Under the described sampling model, assume 16e^" minjdi, d 2 } logd < n < min{d®, kik 2 maxjdf, d|}) 
and A G [8Ai,ciAi] with any constant ci = 0(1) larger than max{8 , 128/y^ min{A:i, /C2}}. Then, 
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solving the optimization (15) achieves 


did2 


0 - 0 * 


< A8V2e‘^^ciXiVr 


0 - 0 * 


min{di,d2} 

+ 48e2"ciAi fTi(0*), 

j=r+l 


(19) 


for any r G {1,..., minjiii, ^ 2 }} with prohahility at least 1 — 2d~^ where d = {di+ d 2 )l 2 . 

A proof is provided in Appendix [P) 0ptimizing over r gives the following corollaries. 

Corollary 4.1 (Exact low-rank matrices). Suppose 0* has rank at most r. Under the hypothe¬ 
ses of Theorem^ solving the optimization (15) with the choice of the regularization parameter 
A G [8Ai,ciAi] achieves with probability at least 1 — 2d~^, 


y/dfdf, 


0 - 0 * 


< 48\/2r3“ 


e Cl 


r{di + ^ 2 ) logd 


n 


( 20 ) 


This corollary shows that the number of samples n needs to scale as 0{r{di + ^ 2 ) log d) in order 
to achieve an arbitrarily small error. This is only a logarithmic factor larger than the degrees of 
freedom. We provide a fundamental lower bound on the error, that matches the upper bound up 


to a logarithmic factor. For approximately low-rank matrices in an .^i-ball as defined in (10), we 
show an upper bound on the error, whose error exponent reduces from one to (2 — q)/2. 

Corollary 4.2 (Approximately low-rank matrices). Suppose 0* G Mq(pq) for some q G (0,1] 
and pq > 0. Under the hypotheses of Theorem^ solving the optimization (15) with the choice of 
the regularization parameter A G [8Ai,ciAi] achieves with probability at least 1 — 2d~^, 


y/didf 


0 - 0 * 


< 


\/ did2 


48\/2: 


ClC 


3q 


did 2 {di -I- d 2 ) logd 


n 


2-9 

2 


( 21 ) 


Since the proof is almost identical to the proof of Corollary |3.2| in Appendix [B| we omit it. 

Theorem 4. Suppose 0* has rank r. Under the described sampling model, there is a universal 
constant c > 0 such that 



r 1 



A 

inf sup E 


0-0* 


> c min < A / 

0 ©*eOc 

L \/ did2 


fJ 

\v 


le (di-I-^2) ajdi + ^2) ) 
^/dfdf\o^ j 


n 


( 22 ) 


where the infimum is taken over all measurable functions over the observed purchase history {(wj, Uj, Si, Tj)}jg[^ 

A proof is provided in Appendix |E.1[ The first term is the dominant term, and when the sample 
size is comparable to the latent dimension of the problem, this theorem establishes that Theorem 
[^is minimax optimal up to a logarithmic factor. 
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5 Discussion 


We list remaining challenges for future research, (a) Nuclear norm minimization, while polynomial¬ 
time, is still slow. We want first-order methods that are efficient with provable guarantees. The 
main challenge is providing a good initialization to start such non-convex approaches. (6) For 
simpler models, such as the PL model, more general sampling over a graph has been studied. We 
want analytical results for more general sampling. 
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Appendix 

A Proof of Theorem [l] 


We first introduce some additional notations used in the proof. Recall that >C(0) is the log likelihood 
function. Let V£(0) G denote its gradient such that VijC{Q) = ■ Let V^£(0) G 

]^did 2 X(ii(i 2 denote its Hessian matrix such that Vjj ^ . By the definition of £(0) 

in 0 , we have 


V£(0*) 


1 

k di 


di k 


i=l i=l 


Pi,if , 


(23) 


where denotes the conditional choice probability at £-th position. Precisely, pi^g^ = ^ Pj\(i,e)^j 

where Pj\(i,i) is the probability that item j is chosen at £th position from the top by the user i condi¬ 
tioned on the top i— \ choices such that Pj\[i/) = IP {vi/ = ..., Si} = ^ 

and Si^i = Si \ ..., Vi/-i}, where Si is the set of alternatives presented to the i-th user and 

Vi^i is the item ranked at the £-th position by the user i. Notice that for i ^ i!, ^ = 0 and 

the Hessian is 


a^£(0) 

dQijdQif 


id; LIU 


1 

k di 


i=i 

k 


G Si^g) {Pj\(i,i)lij = j') - Pj\ii,i)Pj>\ii,i)) ■ 


(24) 


1=1 


This Hessian matrix is a block-diagonal matrix V^£(0) = diag(Lf^^^(0),..., with 


k 

= {diasiPi,i) - Pi,ePl,i) 


(25) 


1=1 


Let A = 0* — 0 where 0 is the optimal solution of the convex program in Q . We hrst introduce 
three key technical lemmas. The first lemma follows from Lemma 1 of |24] . and shows that A is 
approximately low-rank. 

Lemma A.l. If X> 2|||V£(0*)|||2, then we have 


lAII 


< 4\/^|||A| 


min{di,d2} 

If+ 4 ^ uj 

j=P+i 


aiiQ* 


(26) 


for all p G [min{(ii, ^2}]- 

The following lemma provides a bound on the gradient using the concentration of measure for 
sum of independent random matrices m- 
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Lemma A.2. For any positive constant c > 1 and logd > 4(1 + c)/9, with probability at least 
1 - 2d-^, 


||V£(0*)||2 < y max{A/di/d 2 , 6^°^a/ 4(1 + c) log. (27) 

Since we are typically interested in the regime where the number of samples is much smaller 
than the dimension di x d 2 of the problem, the Hessian is typically not positive definite. However, 
when we restrict our attention to the vectorized A with relatively small nuclear norm, then we can 
prove restricted strong convexity, which gives the following bound. 

Lemma A.3 (Restricted Strong Convexity for collaborative ranking). Fix any 0 G Hq 

and assume 24 < k < min{d^,((if + (i|)/(2(ii)} log d. Under the random sampling model of the 
alternatives {jie]i^[di\/^[k] ^he random outcome of the comparisons described in sectionj^ with 
probability larger than 1 — 

— 4q; 

Vec(A)'^V2£(0)Vec(A) > ^|^|||A|||2 , (28) 

for all A in A where 

A={ag I III A|||^ < 2a , A,, = 0 for all i G [di] and |||A|||2 > /r|||A|t^J . (29) 

16 [da] 

with 


pL = 2^°e^"ad2.t 


di logd 


k min{di,d 2 } 


(30) 


Building on these lemmas, the proof of Theorem is divided into the following two cases. In 
both cases, we will show that 

IIIAIIll < 72e"“coAodid2|||A|L„,, (31) 

with high probability. Applying Lemma A.l proves the desired theorem. We are left to show Eq. 
(31) holds. 


Case 1: Suppose |||A|||p > ^ ||| A|||j^^^. With A = 0* — 0, the Taylor expansion yields 
£(0) = £(0*) - ((V£(0*), A)) + ^Vec(A)v2£(0)Vec^(A), 


(32) 


where 0 = a0 + (1 — a)Q* for some a G [0,1]. It follows from Lemma A.3 that with probability at 


least 1 — 2d 


-2^ 


£(0)-£(0*) > -((V£(0*),A)) + 


3 —4q; 


48 di d2 




> -|||V£(0* 


yil^lllnuc /ic J J III^IIlF • 


48 di d2 


|A|| 
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From the definition of 0 as an optimal solution of the minimization, we have 


C{e)-CiQ*) < A |||0 


0 


< A|||A|||, 


By the assumption, we choose A > 32Ao- In view of Lemma A.2, this implies that A > 2||| V£(0* 
with probability at least 1 — 2d~^. It follows that with probability at least 1 — 2d~^ — 


^ —4q; 


_IIIAIIr 

48(iid2 ^ 


< (A + |||V£(0*)|||2)|||A| 


„ 3A 

< — 
llnuc — 2 


By our assumption on A < cqAo, this proves the desired bound in Eq. (31) 

Case 2: Suppose |||A|||p < /x ||| A|||^^^. By the definition of ^ and the fact that cq > 32, it follows 
that ^ <72 e^"coAo did 2 , and we get the same bound as in Eq. (31). 


A.l Proof of Lemma lA.ll 

Denote the singular value decomposition of 0* by 0* = where U G and V G M'^ 2 xd 2 

are orthogonal matrices. Eor a given r G [min{di, d 2 }], Let Ur = [ui,..., Ur] and Vr = [ui,..., u^], 
where Ui G and Vi G are the left and right singular vectors corresponding to the i-th 

largest singular value, respectively. Define T to be the subspace spanned by all matrices in M'^iX'i 2 Qf 
the form UrT^ or BVj^ for any A G M'^^xr or B G respectively. The orthogonal projection of 

any matrix M G xd 2 onto the space T is given by Vt{M) = UrUjM + MVrV^ — UrUjMVrVj'. 
The projection of M onto the complement space T-*- is Vrp^{M) = (/ — UrUj)M{I — VrVj'). The 
subspace T and the respective projections onto T and T-*- play crucial a role in the analysis of 
nuclear norm minimization, since they define the sub-gradient of the nuclear norm at 0*. We refer 
to [23] for more detailed treatment of this topic. 

Let A' = Vt{7^) and A" = 'Pp±(A). Notice that Vt{&*) = UrT,rV^, where G is the 
diagonal matrix formed by the top r singular values. Since Vt{&*) and A" have row and column 
spaces that are orthogonal, it follows from Lemma 2.3 in [22j that 

|||PT(e*) - A"|||„„^ = irT(e*)lll.„, + III A"|||.„,. 

Hence, in view of the triangle inequality. 


0 


= IIIPtO*) + PtU©') - a' - A"|||__„^ 

> lllx>T(e*) - A"|||„„^ - |||P^40-) - A'|||„„ 

= IIIPT(e-)lll„„, + |||A"|||„„^ - IIIPtU®-) - A'|||„„ 

> IIIPT(e-)|||„„, + |||A"|||„„^ - rTUe-)|||„„ - IIIA'I 

= llie*lll„„o + ll|A"||L„ - 2rTUe*)|||„„ - IIIA'III 


(33) 


Because 0 is an optimal solution, we have 


0 


- Ill©* 


< £(0*) - £(0) < ((A, V£(0*))) < |||A|t^J||V£(0* 


I2 — 


< - 


nuc’ 

(34) 
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where (a) holds due to the concavity of £; (6) follows from the Cauchy-Schwarz inequality; the last 
inequality holds due to the assumption that A > 2|||V£(0*)|||2. Combining (33) and (34) yields 


Thus 


2 (III A"|L„, - 2r^ue-)|||„„ - III A'|||„J < |||A|||„„ < |||A'|j|„„^ + 

A"IIU„c < 3||| A'|||„„^ + 4|||Pri (S‘)lll,mc- By triangle inequality, 

ll|A|lln„c<'l|IA'IL„o + '‘III^TUe-)|||„„. 



nuc* 


Notice that A' = + (I - UrUj)AVrV^. Both UrU^A and (I - UrUj)AVrVj' have rank 

at most r. Thus A' has rank at most 2r. Hence, |||A'|||j^^^ < -v/^|||A'|||p < -v/^|||A|||p. Then the 
theorem follows because |||'Pp±(0*)|||nuc = cri(0*)- 


A.2 Proof of Lemma IA.2I 

Define Aj = —e* e ~ such that V£(0*) = YliLi Ai, which is a sum of di inde¬ 

pendent random matrices. Note that since pi^i has {k + 1 — i) non-zero entries, each bounded in 
absolute value by e^"/(fe + 1 — i), we have the following bound deterministically: 


IA 


UII2 


Pi,e) 


e=i 


< 


Vk+Y, 


\Pii 


i=i 


1 


< Vk + e^'^Y-^ _ 

Vk + l-i 

< y/k -|- e^"2(VA: -1-1 — 1) 

< 3e2“\/fe , 


and 


[AgA,^ 


< 9e^°‘k 


di 

E 

i=l 


E [i 


e,:et 


= 9e 


Aa 


ixdilll2 


= 9e‘^°k, 
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and 


d\ d\ k 

'^E[Xj'Xi] {ei,£ - Pi,i){ev^t, - Pi/'? 

i=l i=l e/'=l 

di k 

= X] X] “ Pi,i)ievi^e - Pi,if] 

i=l £=1 
di k 


= EE(^ 

i=l i=l 
di k 

=?EEe 

i=l i=l 
di 






- E [Pi/pf /]) 


Therefore, 


2=1 


f;iE[xfXi 

2=1 


< 


kdi 

It? 


By matrix Bernstein ineqnality I2ZI, 

f(||| V£(e-)fc > *) < (* + rf.) exp ' 

which gives the desired tail probability of 2d~'^ for the choice of 

J I 4(1 + c) logd 4(1 + c)e^" logd 1 
|Y/cdi min{(i2,c^i/(9e^")} ’ k^/'^ di J 

= e^^V4(l + c)logd| , 

where the last equality holds due to the assumption that logd > 4(1 + c)/9. 

A.3 Proof of Lemma IA.3I 

Recall that the Hessian matrix is a block-diagonal matrix with the i-th block given by 

(25). We use the following remark from [T^ to bound the Hessian. 

Remark A. 4. fT^l Claim 1] Given 6 £ M'’, let p be the column probability vector with pi = 
gSi/(gt'i g^'p^ Jq^ gQg/j i g [pj jgj. positive integer p. If \di\ < a, for all i £ [p], then 


„2« 


(diag{p) - pp^^ E -diag(l) - ^11^ 
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By letting 15 ^ ^ = Ylj&Si i applying the above claim, we have 


Hence, 




E 7 i~ 7 Zn 2 E (ei-ei')(ei-ei')' 


1 ^ ^ 

-^P^xE E (ej - e,v)(e,-- e,v)^. 

^=1 jJ'&Si^e 


di 

Vec(A)V2£(0)Vec^(A) = ^(A'^ei)^H'«(0)(A'^ei 

i=l 

_2a ^ 


> 


2A:3 di 


EE E IINA( 


e, - ej.Jlllj. 


*=1 ^=1 j,j'&Sij 

By changing the order of the summation, we get that 


E E 

^=1 




E E K^i^h,l") ^ min{cJi(ji,^),cJi(P£/)}). 




i"=l 


Define 


and let 


(35) 


F(A) 


g-2Q di k k 

2pX E E E 

-L ,•_ 1 /] /)! _ 1 /?//_ 1 


Then we have Vec^(A)V^£(0)Vec(A) > H{A). To prove the theorem, it suffices to bound H{A) 
from the below. First, we prove a lower bound on the expectation E[Fr(A)]. Notice that for i ^ T, 
the conditional expectation of given the set of alternatives presented to user i is 


E 


[E 


£"=i 


Xi, 




! Ji,k 


y- __ 

exp(0ij^ ^„) + exp{9ij^ ^,) + exp(0ij,_,) 

> 1 k — 2 ^ k 
- 1 + 2e2« - 3^' 
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Then, 


E[H{A)] = 


^-2a 


2k^ di 


E ((A, E[ I A,i) • • ■) jj,fc] 


i/,l' 

—4a 


> 


6A:2 di 


y ^ y y E ((A, ^hji, 


£"=1 


*=1 l,l'&[k] 
,-4a '^1 


6A:2 di 


Q £*2 C) ^2 

yz yy 1 ^ yy ^ yy 

i=i £^f'e[fc] V i=i 2 




e-4“(A:-l) 2 

I^IIIf ’ 


3kdid2 


(36) 


where the last equality holds because X)je[(i 2 ] = 0 for A G O 20 and for all i G [di]. 

We are left to prove that H{A) cannot deviate from its mean too much. Suppose there exists 
a A G .4 such that Eq. (28) is violated, i.e. H{A) < (e“^“/(24 did2))|||^||lF- We will show this 
happens with a small probability. From Eq. (|36|), we get that for k > 24, 


3— 4q: 


^ (20/3)e-4“,„^,„2 

24did2 ■ 


(37) 


We use a peeling argument as in [2T1 Lemma 3], [28] to upper bound the probability that Eq. 
(37) is true. We first construct the following family of subsets to cover A such that A C IJ^;^ Sg,. 
Recall ^ = 2 ^‘^e^“ad 2 -\/(di log d)/{k min{di, ^ 2 }), define in (30). Notice that since for any A G »4, 
III^IIIf ^ /^lll^lllnuc — /^III^IIIf! follows that |||A|||p > /r. Then, we can cover A with the family of 
sets 


5^ = |a G 

where j3 = y^lO/9 and for i G {1, 2, 3,...}. This implies that when there exists a A G ^ such that 


< 2a, (3^ |||A|||p < /S^/U , Aij = 0 for all i G [di], and |||A|||j^^^ < j3 

i 6 [(i 2 ] 


S 


(37) holds, then there exists an 7 G Z+ such that A G 5^ and 

(20/3)e-^“ 


E[dd(A)] - id(A) > 


24 did 2 






„-4o 


4did2 


(38) 


Applying the union bound over i G Z+, we get from (37) and (38) that 

4a 


3A G ^ , H{A) < 


24 did 2 

00 

s E' 

£=i 


p^ < sup ( E[id(A)] - dd(A) ) > 

g=i 

—4q; 

sup ( E[dd(A)] - H{A) ) > :^(/3V)' 


„-4a 


AeBiAA 


4 did 2 


(39) 
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where we define a new set B{D) such that C 


B{D) = { A G I IIAlloc < 2a, |||A|||p < Z?, A^, = 0 for all i G [di],/i|||A|t^, <£>'}. 

i 6 [<i 2 ] 


(40) 


The following key lemma provides the upper bound on this probability. 
Lemma A. 5. For (16 min{(ii, ^ 2 } logd)/(3(ii) < k < dllogd, 


I sup (E[ZZ(A)]-iZ(A)) > 
[AeB(D) ^ ^ 


^—4a 


^—4a 


4did2^ 


-D^ 


< exp \ — 


2^^a^didl / 


(41) 


Let T] = exp 


— ^bound to (39), we get 

e-4“A:(/5V)^' 


3— 4q: 


3A G A , iZ(A) < 


24 d\d2 


F r - 


£=1 

(a) “ 

< 


r e -1 


J]exp{ 


< 


£=1 

r] 


e-*^4ki{l3 - 1 . 002 )//^ 
2^®a^di(i2 


1 ’ 
1 — ?7 


where (a) holds because /3* > xlog/3 > x{f5 — 1.002) for the choice of /3 = y^ 10/9. By the definition 
of fi, 


rj = exp 


r 223 (logd) 2 (,d- 1 . 002 )'! , , 0 , 


where the last inequality follows from the assumption that k < max{cii, } log d = (^ 2^1 log d) / (i 

and /3 — 1.002 > 2 “^. Since for d>2, exp{—2^® log d} < 1/2 and thus rj < 1 / 2 , the lemma follows 
by assembling the last two displayed inequalities. 


A.4 Proof of Lemma IA.5I 

Recall that 


£,£'=! £"=1 

with Xi^e,e'/' = ^ < min{cri(jV))^*(i*T)})- Let Z = sup^ggp) E[iZ(A)] - H{A) be the 

worst-case random deviation of H{A) form its mean. We prove an upper bound on Z by showing 
that Z — E[Z] < e“^“L>2/(64(iid2) with high probability, and E[Z] < 9e“^“H2/(40did2)- This 
proves the desired claim in Lemma |A.5 

To prove the concentration of Z, we utilize the random utility model (RUM) theoretic interpre¬ 
tation of the MNL model. The random variable Z depends on the random choice of alternatives 


k k 


H{A) = 


-,-2a 


2 P dl 


di 

E 

2=1 


m{di,d2})^, 
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{ji/}ie[di],£e[fc] the random A:-wise ranking outcomes {(yi}i^[dx]- The random utility theory, pio¬ 
neered by uum, tells us that the k-wise ranking from the MNL model has the same distribution 
as first drawing independent (unobserved) utilities Ui^s of the item for user i according to the 
standard Gumbel Cumulative Distribution Function (CDF) F{c — with F{c) = and 

then ranking the k items for user i according to their respective utilities. Civen this definition of 
the MNL model, we have Xm' f = I ^ max{ttj^£, Thus Z is a function of indepen¬ 
dent choices of the items and their (unobserved) utilities, i.e. Z = Let 

Xi,£ = and write H{A) as H{A, {xi^i}i^[di],ee[k])- This allows us to bound the difference 

and apply McDiarmid’s tail bound. Note that for any i £ [di], i G [k], xip,..., and x'-^, 


I /( SJl,!, . . . , Xi^(i, . . . , Xdi^k ) /( 3^1,1) • • • ) • • • ) Xdi^k ) | 

= 1 sup {E[H{A)]- H{A,xi^i,...,Xi^e,...,Xdi,k)) - sup {E[H{A)] - H{A,xi^i,... ,x'i ... ,Xdi,k)) 

A&B{D) AeB(D) 

< sup |iL(A,xpi,...,Xi/,...,Xdi,fc) -id(A,xpi,...,x'^, ...,Xd^,fc)| 

A&B{D) 


(a) g-2a 
< 


2k^ di 
('>) 8a 


sup |2 ((A, X!/ Xi,£',£"/ 

£',£"&[k] 


A(£B{D) 

2„-2a k 


£"=1 


} 


8a e r 




.// -|- 


£'e[fc]\{£} t"=i 


Xi,£',£",£^ 


£',£"(£[k],£'j^£", 


< 


16a^e 


2 „- 2 « 


kdi ’ 


where (a) follows because for a fixed i and i, the random variable Xi^i = (ji/, ui^i) can ap¬ 
pear in three terms, i.e. Y.£'Xi,t,£'r + Y.V~ ei, 3 i,i))‘^Xi,£'/r + 
£//((A, ejj, — ejj. and (6) follows because \Aij\ < 2a for all i, j since A G 13(D). 

The last inequality follows because in the worst case, Yl£'^[k]\{£}Yl£"=iXii £' i" < k(k — l)/2 and 
'^£' £"&[k] £'i^£" Xi,£',£",£ < k(k — !)• This holds with equality if cri{ji/) = k and cri{ji/) = 1, respec¬ 
tively. By bounded differences inequality, we have 

P{Z-E[Z1>() < exp(-^AA^), 


It follows that for the choice of t = e ^“D^/(64did2), 

j-4a^2 


Z - E [Z] > 


64 ^ 1^2 


< exp I — 


^—4a 


kD^ 


2i9a'^di(in 


We are left to prove the upper bound on E[Z] using symmetrization and contraction. Define 
random variables 


Li,£/'/"(A) 


Xi,£,£',£" 


(42) 


where the randomness is in the choice of alternatives and ji/”, and the outcome of the 

comparisons of those three alternatives. 
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The main challenge in applying the symmetrization to is that we need to 

partition the summation over the set [A:] x [k] x [A:] into subsets of independent random variables, such 
that we can apply the standard symmetrization argument, to this end, we prove in the following 
lemma a a generalization of the well-known problem of scheduling a round robin tournament to 
a tournament of matches involving three teams each. No teams are present in more than one 
triple in a single round, and we want to minimize the number of rounds to cover all combination 
of triples are matched. For example, when there are k = 6 teams, there is a simple construction 
of such a tournament: Ti = {(1, 2, 3), (4, 5, 6 )}, T 2 = {1, 2,4), (3, 5, 6 )}, T 3 = {(1, 2, 5), (3,4, 6 )}, 
r4 = {(l,2,6),(3,4,5)}, r5 = {(l,3,4),(2,5,6)}, r6 = {(l,3,5),(2,4,6)}, Tt = {(1, 3, 6 ), (2,4, 5)}, 
Tg = {(1,4,5), (2,3,6)}, Tg = {(1,4, 6 ), (2, 3, 5)}, Tio = {(1, 5, 6 ), (2, 3,4)}. This is a perfect 
scheduling of a tournament with three teams in each match. For a general k, the following lemma 
provides a construction with 0{k‘^) rounds. 

Lemma A. 6 . There exists a partition (Ti,... ,Tm) of [k] x [A:] x [A:] for some N < 24A:^ such that 
Ta’s are disjoint subsets of [k] x [k] x [A:], Ta = [A:] x [A] x [k], \Ta\ < [A/3J and for any 

a G [N] the set of random variables in Ta satisfy 

id'/''}ieldi],{e,e',e")£Ta mutually independent . 


Now, we are ready to partition the summation. 


E[Z] = 


< 


< 


e 

2 A3 di 

g-2a 

2 A3 di 


E 


E 


sup E (A)] — F)^£^£/^£//(A)}J 

sup EE E (A)] - 

AeB(D) 


„—2a 


^E sup ^ ^ {E\Yi^i^p^p>{A)]-Yi^(^^P^P>{A)] 


a&[N] je[di] (£,£',r")eT 


^—2a 


, 0 , ^2 ^idd'd"yidd'd"i^) 

^ ae[Af] i&[di](id'd")&Ta 


3 —2q: 


^ E sup V ^ 


^ a£[N] (£,£',e")£Tc 


£" 


(43) 


where the first inequality follows from the fact that sum of the supremum if no less than the 
supremum of the sum, and the second inequality follows from standard symmetrization argument 
applied to independent random variables {yj^^^£/^£//(A)}jg[(^^] with i.i.d. Rademacher ran¬ 
dom variables Ci,£/z^^„’s. Since (Ajj. ^ - d" A 4Q;|Ajj.^^ - Ajj, ^ |xi,r,£'/", we have by 

the Ledoux-Talagrand contraction inequality that 


E 


sup ..E E f,idd'd"i^i,ji,t ~ ^i,ji^ii2xidd'd" 

AeB(D) 


< 8aE 


sup ,.E E iidd'd" Xidd'd" ~ 

i^\^dr]{e.d'd")&Ta 


(44) 
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Applying Holder’s inequality, we get that 


ieldi] {£,£',e")eTa 


< 


iiU'r XiU' 




(45) 


We are left to prove that the expected value of the right-hand side of the above inequality is 
bounded by CHI log d/ min{(ii, ^ 2 } for some numerical constant C. For i G [di] and 

{£,£',£") G Ta, let ,e" Xi,e/',e" ~ ^jniY') be independent zero-mean random 

matrices, such that 


lltr,, 


\\\2 = 


almost surely, and 


^i/AA' Xi/AA' ~ 

< \/2 , 

2 

= H{(^ii^3i,e - i^3i,i - ^3 

i,ii)^'i)XiA^' A' 

= 2E [xi, £/',£"] ejcf 


A 2ejef , 



and 


^ - Cj,,, )ejei(ej^ , - f] 


2 


T _ _11^ 

^d 2 X d 2 ^2 


This gives 


a = max 


< max < 2|ra| , 


E E 

ie[di] {£AA')( 3 .Ta 

2di|r,n 2di|r, 


d2 


< 


ie[di] {tAA')&Ta 
2dik 


2 ) 


min{di,d2} 3min{di,d2} ’ 


since we have designed T^’s such that jEal < A:/3. Applying matrix Bernstein inequality |27j yields 
the tail bound 


i&[dl] (lAA')&Ta 


> t > < (di -I- ^ 2 ) exp ^ 


-t^/2 


0-2 + ^t/3 


Choosing t = max { Y^32/cdi logd/(3min{di, ^ 2 })) (16-v/2/3) logd }, we obtain with probability at 
least 1 — 2d“^, 


E E ^i,eAA' 

ie[di] {iAA’)&Ta 


< max 


2>2kdi\ogd 16\/21ogd 


3min{di, ^ 2 } ’ 
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It follows from the fact 


Sie[rfi] S(£/'/'')era ^ '^i,{e,£',£") Ill^*//'>^"lll 2 — V^diklS that 


E 


E E Wi^e,e',e" 

i£[dl] {l/',l")eTa 


< max 


< 2, 


32A;dilogd 16\/21og(i 1 ^ 2\/2dik 


I y 3 minjiii, ^ 2 } ’ 
32kdi logd 


3d? 


3min{cii,d2} ’ 

where the last inequality follows from the assumption that (16min{(ii, ^ 2 } log(i)/(3(ii) < k < 


df logd. Substituting this in the RHS of Eq. (45), and then together with Eqs. ( |44[ ) and (43), this 
gives the following desired bound: 


16ae 


-2a 


32kdi logd 


E[Z] < sup „ ,, . I 

^^jAeB(D) y 3mm{di,d2} 


< 


< 


E 


o— 4q; 


^/2 


ae[Af] 
9e-^"E>2 


lQy/3k? di d2 


(2^'^e^°‘ad2^ 


di log d 


k min{di, ^ 2 } 




40di(i2 ’ 

where the last inequality holds because N < 4fc^ and 1*111 < Z)2, 


A.5 Proof of Lemma IA.6I 

Recall that (A) = (Ajj.^ — Aij.^,)‘^Xi,£,£',£"} as defined in ( [4^. Erom the random utility 

model (RUM) interpretation of the MNL model presented in Section IR it is not difficult to show 
that Yi^i^gt^gn and gtt are mutually independent if the two triples (£, and (!', do not 

overlap, i.e., no index is present in both triples. 

Now, borrowing the terminologies from round robin tournaments, we construct a schedule for 
a tournament with k teams where each match involve three teams. Let Ta^b denote a set of triples 
playing at the same round, indexed by two integers a G {3,..., 2A: — 3} and 6 G {5,..., 2A: — 1}. 
Hence, there are total N = {2k — 5)^ rounds. 

Each round (a, b) consists of disjoint triples and is defined as 

Ta,b = {{i,i',i'')e[k]x[k]x[k]\i<e'<£”,£ + £' = a, and £'+ i" = b} . 

We need to prove that (a) there is no missing triple; and (6) no team plays twice in a single 
round. Eirst, for any ordered triple {£, £', £"), there exists a G {3,..., 2A; — 3} and 6 G {5,..., 2A; — 1} 
such that £ + £' = a and £' + £'' = b. This proves that all ordered triples are covered by the above 
construction. Next, given a pair (a, 6), no two triples in Ta^t can share the same team. Suppose 
there exists two distinct ordered triples {£,£',£") and {£,£',£") both in Ta^, and one of the triples 
are shared. Then, from the two equations £ + £' = £ + £'= a and £' + £'' = £' + £" = b, it follows 
that all three indices must be the same, which is a contradiction. This proves the desired claim for 
ordered triples. 
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One caveat is that we wanted to cover the whole [k] x [k] x [k], and not just the ordered triples. 
In the above construction, for example, a triple (3,2,1) does not appear. This can be resolved 
by simply taking all Ta^bS from the above construction, and make 6 copies of each round, and 
permuting all the triples in each copy according to the same permutation over {1,2,3}. This 
increases the total rounds to N = 6{2k — 5)^ < 24A:^. Note that \Tafi\ < [k/3\ since no item can 
be in more than one triple. 


B Proof of estimating approximate low-rank matrices in Corollary 

n 


We follow closely the proof of a similar corollary in pTj. First fix a threshold r > 0, and set 
r = max{j|cjj(0*) > rj. With this choice of r, we have 


E 

j=r-\-l 


mm{di,d2} 




aj{e* 


< T 


j=r+l 


min{di,d2} 

E 

j=r+l 




< T 


1-g 


Pq 


Also, since rr*? < < Pq, h follows that y/r < y/f^r . Using these bounds, Eq. Q 

is now 


0-0 


< 288 \/ 2 coe^°did 2 Aq 


0-0 




=A 


With the choice of r = A, it follows after some algebra that 


0-0 




C Proof of the information-theoretic lower bound in Theorem [2] 


The proof uses information-theoretic methods which reduces the estimation problem to a multiway 
hypothesis testing problem, to prove a lower bound on the expected error, it suffices to prove 


sup 

0*eOc 



0 * 


El!) 

F 4 / 


1 

> - . 

“ 2 


(46) 


To prove the above claim, we follow the standard recipe of constructing a packing in . Consider a 
family { 0 ^^^ ..., 0 ('^(^)} of dixd 2 dimensional matrices contained in Uq, satisfying — 0 (U) |||p > 

6 for all [-A^(^)]- We will use M to refer to M{5) for simplify the notation. Suppose we 

draw an index L G [M{5)] uniformly at random, and we are given direct observations at as per 
MNL model with 0* = on a randomly chosen set of k items Si for each user i G [di]. It follows 
from triangular inequality that 


sup 

©*efia 



0* 


El!) 

F 4 / 




(47) 
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where L is the resulting best estimate of the multiway hypothesis testing on L. The generalized 
Fano’s inequality gives 


r{L^L\S(l),...,S(A)} > 

^ J (?)'‘E,.,fe6[«|PKL(ei'-)||e('»i) + iog2 
“ log M 

where 11©^^^^) denotes the Kullback-Leibler divergence between the distributions of the 

partial rankings P {ui,..., S'(l),..., 5(di)} and P {cJi,..., 5(1),..., 5(di)}. The 

second inequality follows from a standard technique, which we repeat here for completeness. Let 
S = {ui,..., Od^} denote the observed outcome of comparisons. Since form a Markov 

chain, the data processing inequality gives I{L]L) < I(T,]L). For simplicity, we drop the condi¬ 
tioning on the set of alternatives {5(1),... ,5(di)}, and and let p(-) denotes joint, marginal, and 
conditional distribution of respective random variables. It follows that 


(48) 

(49) 


/(S;L) = 


< 


p(EK)ilog 


p(£,S) 


p(£)p(S) 

Pirn 


te[M] s 


1 

w 


^ ^p(S|^)log 


wEi'Pinn 

Pirn 


e,t^[M] s 


p(S|f) 


M2 




e,e'(^[M] 


(50) 


where the first inequality follows from Jensen’s inequality. To compute the KL-divergence, recall 
that from the RUM interpretation of the MNL model (see Section fll), one can generate sample 
rankings S by drawing random variables with exponential distributions with mean e *j ’s. Pre¬ 
cisely, let = [xf^]i^ydi\,j&Si denote the set of random variables, where is drawn from the 

-e(^) 

exponential distribution with mean e . The MNL ranking follows by ordering the alternatives 
in each Si according to this {X^j^}j^Si by ranking the smaller ones on the top. This forms a Markov 
chain L-X^^'^-T,, and the standard data processing inequality gives 

(51) 

= E E - (e<L - elf) -1} (52) 

2a 

s I;? EE (elf-©ST, (53) 

*6[(ii] jeSi 


where the last inequality follows from the fact that e’” — x — 1 < (e^“/(4 q;^))x^ for any x G [—2a, 2a]. 
Taking expectation over the randomly chosen set of alternatives. 


%(!),...,5(do[^KL(0('^^||0('^^)] < 


k 

4a2 d2 


@(^i) _ @(U) 


2 

F ' 


(54) 
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Combined with (49), we get that 


^ J S>.,fa,[«|(<=^°fc/(4a^ri2))|||e«.) - efe'inl + log 2 

“ logM 

The remainder of the proof relies on the following probabilistic packing. 


(55) 

(56) 


Lemma C.l. Let d 2 > di > 607 be positive integers. Then for each r G di}, and for 

any positive <5 > 0 there exists a family of di x d 2 dimensional matrices ..., with 

cardinality M{5) = [(1/4) exp(rd2/576)J such that each matrix is rank r and the following bounds 
hold: 


0 (^) 

F 

< 


for all i G [M] 

(57) 

1 

® 

to 

F 

> 


for all £ 1 , ^2 G [M] 

(58) 

0 W 

G 


, for all t G [M] , 

(59) 


with a = ( 8 ( 5 /d 2)\/2 log d for d = {di + d 2 )/ 2 . 

Suppose 6 < ad 2 /{ 8\/2 log d) such that the matrices in the packing set are entry-wise bounded 
by a, then the above lemma implies that — 0 (^ 2 )|||^ < 4 , 52 ^ which gives 


P 



> 


1 - 


ci^d2 

rd _ 

576 


+ log 2 
2 log 2 


> 


1 

2 ’ 


where the last inequality holds for <5^ < {a‘^d2/{e‘^°^k)){{rd/1152) — 2 log 2). If we assume rd > 3195 
for simplicity, this bound on 6 can be simplified to d < ae~°^^r d 2 d/(2304 A:). Together with (46) 
and (47), this proves that for all 5 < min{ad 2 /( 8\/2 log d),ae~°'^r d 2 d/ (2304/c)}. 


inf sup E 
0 B*&na 




4 ■ 


Choosing 5 appropriately to maximize the right-hand side finishes the proof of the desired claim. 


C.l Proof of Lemma lC.il 

Following the construction in |24j , we use probabilistic method to prove the existence of the desired 
family. We will show that the following procedure succeeds in producing the desired family with 
probability at least half, which proves its existence. Let d = (di -|- d 2 )/ 2 , and suppose d 2 > di 
without loss of generality. For the choice of M' = e'''^2/576^ £qj. ^ g [Tf'], generate a rank-r 
matrix 0 ^^^ G ■^dixd 2 g^g 

0 W = ^U{V^^Y(}d 2 xd 2 -^^^^^) , (60) 

where U G is a random orthogonal basis such that U'^U = I^xr and G is a random 

matrix with each entry G { —1,-|-1} chosen independently and uniformly at random. 
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By construction, notice that |||0^^^|||p = — (1/^^2)11^)||| p < (5, since 

'Jrd 2 and (I — (l/d 2 )ll^) is a projection which can only decrease the norm. 

Now, consider |||0(^i) -0(^2)|||2 ^ (6^/{rd 2 ))\\\{I - 
which is a function over 2 rd2 i.i.d. random Rademacher variables and which define 0(^i) 
and 0 (^ 2 ) respectively. Since / is Lipschitz in the following sense, we can apply McDiarmid’s con¬ 
centration inequality. For all (^y(h) that differ in only one variable, say 

1 /(^ 1 ) = 1 /(^ 1 ) + 2ejj, for some standard basis matrix eij, we have 


f{y(ti)^V(^2)) _ f(y(h)^V{t2))\ ^ 

rd 2 

^ . 2 

rd 2 


(I - — 

d 2 


rd 2 


(I - ^ 2eij) 

«2 


2(1 - 

«2 


p T (12 U 2 


<4^ _ 6 _ 

~ rd 2 rd 2 


(I - —- 1 /(^ 2 )) 

d 2 


|2e. 




< 


12 6 ^ 
rd 2 


(61) 

(62) 

(63) 

(64) 


where we used the fact that (I — — 1 /(^ 2 )^ jg entry-wise bounded by four. The expec¬ 

tation E[/(R(^i), R(^2))] ig 


rd 2 


E 


(I - —- T/(^2)) 

d 2 


-E 

) 

-E 


252 
rd 2 
252 
rd 2 
2 <52 (d2-l) 


(I - 

«2 


R(b) 

2' 

-^E 



F 

rdl 

- 


d 2 

Applying McDiarmid’s inequality with bounded difference 126'^/{rd 2 ), we get that 

]p{/(y(R),yff2))< 2^2(1 _ 1 /^ 2 ) I < 


(65) 

( 66 ) 

(67) 

( 68 ) 


Since there are less than (M')2 pairs of (^ 1 , 72 ), setting t = (1 — 2 /d2)S‘^ and applying the union 
bound gives 


mm 

ii,e 2 e[M'] 


0(^i) _ @(^ 2 ) 


> 


> 1 — exp I — 


r d 2 
144 


2 \2 


'-s) 


+ 2 


logM'} > g , (69) 


where we used M’ = exp{rd2/576} and d 2 > 607. 

We are left to prove that 0(^)’s are in H( 35 /^ 2 )\/ 2 Tog^ defined in Q. Since we removed 
the mean such that 0(^H = 0 by construction, we only need to show that the maximum entry is 
bounded by {86/d2)\/2 log d 2 - We first prove an upper bound in 0 for a fixed i G [M'], and use 
this to show that there exists a large enough subset of matrices satisfying this bound. From (125), 


28 




































































consider {UV'^)ij = {{ui, Vj)), where Ui G M'" is the first r entries of a random vector drawn uniformly 
from the (i 2 -dimensional sphere, and Vj G W' is drawn uniformly at random from { —1,+1}^ with 
ll^ill ~ Using Levy’s theorem for concentration on the sphere [29], we have 


'iui,Vj))\ >t} < 2 exp I - . 


(70) 


Notice that by the definition (125), maxjj |0j-j | < {2d/\/rd2) maxjj \{{ui,Vj))\. Settingt = ^/{32rJd2yk>g^ 


and taking the union bound over all did 2 indices, we get 


I maxI < — I > 1 - 2 did 2 exp | - 4logd 2 | 


> 


2 ’ 


(71) 


for a fixed i G [M']. Consider the event that there exists a subset S C [M'] of cardinality M = 
(1/4)M' with the same bound on maximum entry, then from (0 we get 


35 C [M'] such that 


0M 


d 2 


M' 


< aiu . 5 } > f: ("') (I)", (72) 

^ m=M ^ ^ 


which is larger than half for our choice of M < M'/2. 


D Proof of Theorem [3] 


We use similar notations and techniques as the proof of Theorem in Appendix [^ From the 
definition of T(0) in Eq. (©, we have for the true parameter 0*, the gradient evaluated at the 
true parameter is 


V£(0*) = 


(73) 


2 = 1 


where pi denotes the conditional probability of the MNL choice for the i-th sample. Precisely, pi = 
EjisSi EjasTi Pjuj 2 \Si,TiejieJ^ where Pjij 2 \S„Ti is the probability that the pair of items (ji, J 2 ) is cho- 

sen at the i-th sample such that 1 Si = P {(tti, ^'^) = (ji, J2)|5i, TJ = e /(EjjeSij'eTi e "i’" 2 ), 

where {ui,Vi) is the pair of items selected by the f-th user among the set of pairs of alternatives 
Si X Ti. The Hessian can be computed as 


= -1, A) Si X Ti) 
a0i, a0,/ 2 / n oSii ,■/ 


1 
n 


(74) 


2=1 




n 

(il>72) e 5i X Ti) (pjij2|Si,T,I((jl, J 2 ) = (j'i, 72 )) - Ph, 32 \Si,T,Pj[,j' 2 \Si,T^ > 


2 = 1 


We use V^£(0) G denote this Hessian. Let A = 0* — 0 where 0 is an optimal 


solution to the convex optimization in (15). We introduce the following key technical lemmas. 


Lemma A.l Eq. (26) 


The following lemma provides a bound on the gradient using the concentration of measure for 
sum of independent random matrices m- 
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Lemma D.l. For any positive constant c > 1 and n > (4(1 + logd)/max{(ii, ^ 2 }, with 

probability at least 1 — 2 d~^, 


ll | v /:( 0*)|||2 < 


/4(1 + c)e2“max{di,d2} logd 


di d 2 n 


(76) 


Since we are typically interested in the regime where the number of samples is much smaller 
than the dimension di x d 2 of the problem, the Hessian is typically not positive definite. However, 
when we restrict our attention to the vectorized A with relatively small nuclear norm, then we can 
prove restricted strong convexity, which gives the following bound. 

Lemma D.2 (Restricted Strong Convexity for bundled choice modeling). Fix any 0 G 

and assume (min{(ii, (i 2 }/min{A:i, ^ 2 }) logd < n < min{d^ log d, ^ 1^2 max{(i^, (i|} log d}. Under 
the random sampling model of the alternatives {jm}je[n],ae[fci] frowi the first set of items [di], 
{jib}i^[n],b&[ki] frowi the sccond set of items [d 2 ] and the random outcome of the comparisons de¬ 
scribed in section with probability larger than 1 — 

p-2a 

Vec(A)^v2£(0)Vec(A) > ^^|||A|||2 , (77) 


for all A in M where 

A' = {A€ I IIIAIII^ < 2a , A,,,, = 0 and |||A|||2 > fi\\\A\\\^^^] . 

iis[rfi],i2e[(i2] 

with 


fi = 2^°adid2 


logd 


n min{di,d 2 } min{A:i,/c 2 } 


(78) 


(79) 


Building on these lemmas, the proof of Theorem is divided into the following two cases. In 
both cases, we will show that 


lAII 


11^ < 12e^“ciAidid2|||A|t„,, (80) 

with high probability. Applying Lemma A.l proves the desired theorem. We are left to show Eq. 
(SOl) holds. 


Case 1: Suppose |||A|||p > /i|||A|| 


With A = 0* — 0, the Taylor expansion yields 


1 , 


£(0) = £(0*) - ((V£(0*), A)) + 7Vec(A)v2£(0)Vec^(A), 


(81) 


where 0 = a0 + (1 — a)Q* for some a G [0,1]. It follows from Lemma D.2 that with probability at 
least 1 — 2d“^^'^, 


£(0)-£(0*) > 


-2a 

-ll|V£(e-)||yi|A|||„„ + — 


IIIAIII 


2 

F • 
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From the definition of 0 as an optimal solution of the minimization, we have 


C{e)-CiQ*) < A |||0 


0 


< A|||A|||, 


By the assumption, we choose A > 8Ai. In view of Lemma D.l, this implies that A > 2||| V£(0* 
with probability at least 1 — 2d~^. It follows that with probability at least 1 — 2d~^ — 

--2“ 3A, 


Sd\d 2 


< (A + |||V£(0*)|||2)|||A|| 


< — 
nuc — 2 


By our assumption on A < ciAi, this proves the desired bound in Eq. (80) 

Case 2: Suppose ||| A|||p < /I ||| A|||j^^^. By the definition of /r and the fact that ci > 128/ 
it follows that /2 < 12e^“ciAi did 2 , and we get the same bound as in Eq. (80). 

D.l Proof of Lemma ID. II 

Define Xi = —(e„.e^, — pi) such that V£(0*) = (1/n) Aj, which is a sum of n independent 
random matrices. Note that since pi is entry-wise bounded by /{kik 2 )■, 

„ 2 a 


IIA 


^\\\2 — 


< 1 + 




and 


= Y.{neuxl]-p^pJ) 


2=1 


2=1 

n 

< ^E[e„,eXj 




2=1 


e^" n. 


di 


Idl xdi 


(82) 

(83) 

(84) 


where the last inequality follows from the fact that for any given Si, Ui will be chosen with proba¬ 
bility at most e^°‘ jki, if it is in the set Si which happens with probability ki/di. Therefore, 


Y^nxiXj 


2 = 1 


< 


n 


di 


(85) 


Similarly, 




2 = 1 


< 


e^“ n 


d 2 


Applying matrix Bernstein inequality I2ZI, we get 


E{|||V£(0*)|||2 >t} < (di+d2)exp{ 


-nHy 2 


{e'^^nmax{di,d2}/{did2)) + {{I + /^/k^))nt/3) 


( 86 ) 


} ,(87) 
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which gives the desired tail probability of 2d for the choice of 


t = max 




4(1 + c)e2" max{(ii, ^ 2 } logd 4(l + c)(l + ^^j^)\ogd 


did2n 


3n 


} 


U (1 + c)e 2 " maxjdi, ^ 2 } log d 


did2n 


where the last equality follows from the assumption that n > (4(1 + c)e^“(iid 2 logd)/max{di,(i 2 } 

D.2 Proof of Lemma ID.21 


Thee quadratic form of the Hessian defined in (75) can be lower bounded by 


Vec(A)^v2£(0)Vec(A) > 


3-2a 


E E E 


2 fe? ko n ^ ^ 

*=1 iljl6Sii2j2 6T 


=j7(A) 


which follows from Remark A.4 To lower bound H{A), we first compute the mean: 


( 88 ) 


,-2a 


m{A)] = 


^ ^ iiJi6Si jaJaeTi 


"lAlf 


di d2 

where we used the fact that jasT, J 2 ] = (fci^ 2 /(did 2 )) j^e[d 2 ] da 


(89) 

(90) 
= 0 for 


A G ^ 2 ^ in p^ . 

We now prove that H{A) does not deviate from its mean too much. Suppose there exists a 
A G bl' defined in ( [7^ such that Eq. (77) is violated, i.e. H{A) < (e“^"/( 8 A:ifc 2 (iid 2 ))|||A|||p. In 
this case, 


7 p“2a 

E[.H(A)] - .H(A) > ^rw^lllA'"^ 


8did2 


F • 


(91) 


We will show that this happens with a small probability. We use the same peeling argument as in 
Appendix with 

5 ; = {a G I |||A|||^ < 2a,< |||A|||p < /3^/i, A,,,,, = 0, and |||A|t„, < , 

iiS[rfi] jaSlrfa] 

where /3 = y^ 10/9 and for i G {1, 2, 3,...}, and fl is defined in (79). By the peeling argument, there 
exists an £ G Z+ such that A G 5^ and 


E[.H(A)] - .H(A) > 


8d\d2 


9 d\d2 


(92) 
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Applying the union bound over i £ Z. 


( p-'^ct 'I °° f Y -2a 'j 

p|3A.^',ff(A)<—IIIAIIllj < X:n*|™p(E[H(A)|-i?(A))>— 

^ f 7e-2“ 1 

< sup (93) 

^ [AeB'(/3V) J 

where we define the set B'{D) such that 5^ C 

B'{D) = { A G I IIA|U < 2a, |||A|||p <D, ^ A,,,, = 0, mII|A|L„, < I?' } . (94) 

jis[rfi],j2e[<i2] 


The following key lemma provides the upper bound on this probability. 
Lemma D.3. For (minjdi, ^ 2 }/ min{/i:i, ^ 2 }) log d < n < dP logd, 


, tt/an > c ^°‘D^ \ f n min{/c?, fcn} ^ 1^2 

sup E[g(A)] - ff(A) > , } < exp ’ i 2 ; 1 ^ 


A£B'(D) 


2did2 


2^^a^dldl 


} • (95) 


Let r/ = exp i. 002 )(/j) ^ ^ Applying the tail bound to (93), we get 


p-2a 

3A G^', #(A) < —IIIAIII^ ^ < 


8 did2 


f n kik 2 ra.m{k‘i, kn} fl)^ 

2 ^ exp I 

e=i 

00 

J]exp{ 


F r - 


e=i 

/ \ 00 

(a) 

< 


} 


2^0a^dldl 

nkik2 minj/cf, — 1.002)(/i)^ ' 


< 


i=i 

V 


2^^a'^d‘ld2 


1 ’ 
1 — T] 


where (a) holds because > xlogfd > x{l3 — 1.002) for the choice of /3 = -^10/9. By the definition 
of jl, 


2^’’ kik2 max{(i2, ^^i}(logd)^(/3 — 1.002) 


r] = exp 


n 


} < exp{-225logd}. 


where the last inequality follows from the assumption that n < ^ 1^2 maxjdf, d|} logd, and /3 — 
1.002 > 2“®. Since for d > 2, exp{—2^^ log d} < 1/2 and thus r/ < 1/2, the lemma follows by 
assembling the last two displayed inequalities. 


D.3 Proof of Lemma ID.31 

Let Z = sup^gg/(p)) E[id(A)] — H{A) and consider the tail bound using McDiarmid’s inequality. 
Note that Z has a bounded difference of (8a^e“^" maxjfci, A; 2 })/(fci^ 2 ™) when one of the kik 2 n 
independent random variables are changed, which gives 


E{Z —E[Z]>t} < exp 


kfk^n^t^ 

64a^e“^" max{/cf, k 2 }kik 2 n 


(96) 
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With the choice of t = ^ 1 ^ 2 ), this gives 


^-2a 


Z - E[Z] > 


4:did2 




< exp 


klklnD"^ 


2 ^^a‘^ ~d\d‘2 max{fc^, /cl} / 


We first construct a partition of the space similar to Lemma A.6 Let 

k = minimi, k 2 } ■ 


(97) 


(98) 


Lemma D.4. There exists a partition (71, ... ,Tn) of {[/ci] x [/C2]} x {[/ci] x [k2]} for some N < 
2A:|A:|/A: such that Ti’s are disjoint subsets, UtefAf]~ {[^ 1 ] ^ [^ 2 ]} x {[^ 1 ] x [/C 2 ]}, \Ti\ < k and 
for any I G [N] the set of random variables in 7} satisfy 


(^re mutually independent . 

where ji^a for i G [n] and a G [/ci] denote the a-th chosen item to be included in the set Si. 


Now we prove an upper bound on E[Z] using the symmetrization technique. Recall that ji^a is 
independently and uniformly chosen from [di] for i G [n] and a G [ki]. Similarly, ji^b is independently 
and uniformly chosen from [di] for i G [n] and b G [k 2 ]. 


,-2a 


E[Z] = 


2 /cf A:| n 


E 


sup E[(Aj.^^j. ] {^ji,a,ji,b 

AgB'ID) a,a^e[ki] b,b'&[k 2 ] 


..-2a 


< 


< 


2 /cf /c| n 


,-2a 


sup ^ ^ E[(Aj^j 2 

AeB'(-D) 


U 2 1.2 
^2 


n 


££lN] 

^ E _ _ 

mm i=i (iij2,iLi^)er^ 




( 100 ) 

( 101 ) 


where the first inequality follows for the fact that the supremum of the sum is smaller than the sum of 
supremum, and the second inequality follows from standard symmetrization with i.i.d. Rademacher 
random variables ^i,jij 2 , 11 , 12 '®' follows from Ledoux-Talagrand contraction inequality that 


E 


^ ^*,11,12,11,11(^11,12 ^11,11) 

AeB'(^) i=i Qij2jiJi)er, 


< 8aE 


< SaE 


sup c. 

AeB'(D) 


*,J1,J2,11,11 


(^ll,l 2 ^11,11) 


sup |||A|| 

A£B'(D) 


^*,ii,i2,ii,ii (®ii,i2 ®ii,ii) 


*=1 (li ,12,11 ,11) 671 


8aL>2 

< -E 

A 


^*,ii,i2,ii,ii (®ii,i2 ®ii,ii) 

*=1 (li , 12 , 11 ,11) 671 


2J 


( 102 ) 

(103) 

(104) 


(105) 
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where the second inequality follows for the Holder’s inequality and the last inequality follows 
from /i|||for all A G B'{D). To bound the expected spectral norm of the random 


matrix, we use matrix Bernstein’s inequality. Note that 




< \/2 almost surely, 


] — (^/*^l)^dlXdl ) Q-Ild K[(ejj J2 — {‘^/d2)^d2Xd2 

It follows that (T^ = 2n\Ti\/ min{(ii, ^ 2 }, where \7i\ < min{A;i, k 2 }. It follows that 


E E j2 j'bia (' 

{ji,j2,j[,j2)&Te 


j'2 ^j[ ,j'2 


>t\ < 


{di + ^ 2 ) exp I 


-t^/2 


2nmin{fci,/c2} , y/2t 
min{di,d2} ' 3 


}. 


Choosing t = max{y^64n(min{A:i, ^ 2 }/min{(ii, ^ 2 }) logd, (16\/2/3) logd}, we obtain a bound on 

the spectral norm of t with probability at least l—2d~’^. From the fact that j ja 3 [ j'l J 2 

{n/y/ 2 ) min{/ci,A: 2 }, it follows that 


E 


j2ji ,i2 j2 


r /64n min{/i;i, A: 2 }logd /x/ox, jI , 2nmin{A:i, 102 } 

{ V "I + 


< max 

^ /66 n min{/ci, A; 2 } log d 


(106) 

(107) 

(108) 


min{di,d2} 

which follows form the assumption that nminj/ci, ^ 2 } > minjdi, ^ 2 } logd and n < d^ logd. Substi¬ 
tuting this bound in (101), and (105), we get that 


E[Z] < 


< 


16e ^°‘aD^ 


66 log d 


n minj/ci, ^2 } min{ di, d 2 } 


4did2 


(109) 

( 110 ) 


E Proof of the information-theoretic lower bound in Theorem [4] 


This proof follow closely the proof of Theorem in Appendix We apply the generalized Fano’s 
inequality in the same way to get Eq. ([4^ 


P 



(?)~‘ PKL(e<fa)||e(fe)) + iog2 

logM 


( 111 ) 


The main challenge in this case is that we can no longer directly apply the RUM interpretation 
to compete Dkl( 0^^^^ ll©^^^^)- This will result in over estimating the KL-divergence, because this 
approach does not take into account that we only take the top winner, out of those kik 2 alternatives. 
Instead, we compute the divergence directly, and provide an appropriate bound. Let the set of ki 
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rows and k 2 columns chosen in one of the n sampling be 5 C [di] and T C [^ 2 ] respectively. Then, 
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ier 


log 


6“*^ Ei'ese 
feT 






j'er 


/ 
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/rfl\ (d2\ /- 

\ki)\k 2 J s; 
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„ 2 e. 


(«i) 
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/ e * 


(^ 2 ) 
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^2a 


TIC V V "T 

u2u2(di\ (d2\ 
'^l'^2\ki) \k2) S,T i,j 
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^ij 
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e ®v — e / e 'v 


ne 


2q 
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*=?*!{£)(£) t 


E»®‘"’E 


0('^l) 0('^2) 

g o — e V 


(<?2) 




e®»j 


E' 




('?!) 


5q: 


ne 




0('^l) 0('^2) 

g O' — e 


di\/d2\ 


ne' 


5a 


did2 


0(^l) _ 0(^2) 


V 
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( 112 ) 


(113) 


fi©) 


g y 


(114) 

\ 

x)if) 


/ 

(116) 

(117) 

(118) 
(119) 


Here (a) is by definition of KL-distance and the fact that S, T are chosen uniformly from all pos¬ 


sible such sets and (6) is due to the fact that log(x) < x—\ with x = (e®*J Yhi'^s j'&T )/ ( 6*^*2 j'er 


(h) 




The constants at (c) is due to the fact that each element of 0(^i) is upper bounded by a and lower 
bounded by —a. We can get (d) by removing the second term which is always negative, and using 
the bond of a. (e) is obtained because e* where —a < x < a is Lipschitz continuous with Lipschitz 
constant e“. At last (/) is obtained by simple counting of the occurrences of each ij. Thus we 
have. 


© 

e 


ni) 

i'j' 






> 1 - 


t2S[M] did2 


p)(fe) _ 0(&) 
ij ij 


+ log 2 


log M 

The remainder of the proof relies on the following probabilistic packing. 


( 120 ) 


Lemma E.l. Let d 2 > di he sufficiently large positive integers. Then for each r £ {1,... ,di}, and 
for any positive (5 > 0 there exists a family of di x d 2 dimensional matrices {0^^X • • • ^ 0*^^*^^^^} with 
cardinality M{5) = [(1/4) exp(rd2/576)J such that each matrix is rank r and the following bounds 
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hold: 


< 5 , for all £ G [M] 


> -6 , for all £i ,£2 G [M] 

F ^ 


©w 

©(^i) _ ©(^ 2 ) 

G , for all £ G [M] , 

with a = { 86 /d 2)\/2 log d for d = {di + d 2 )f 2 . 

Suppose 6 < ad 2 /{ 8 ^/ 2 Jogd) such that the matrices in the packing set are entry-wise bounded 
by a, then the above lemma E.l implies that — 0 (^ 2 )|||^ < 4 ^ 2 ^ which gives 


( 121 ) 

( 122 ) 

(123) 


\l^l} 


^ , ^+I°g2 ^ 1 

2 ’ 


did2 _ 

gi-21og2 


(124) 

_ _ V l/^rr V /. 

576 

where the last inequality holds for < (r(iid|/(1152e®“n)) and assuming rd 2 > 1600. Together 
with (124) and (122), this proves that for all 6 < min{ad 2 /( 8 y^ 2logd),rdid^/(1152e®“n)}, 


inf sup E 
e e*eOc 


0 - 0 =* 


F J 


> -5/4 


Choosing 6 appropriately to maximize the right-hand side finishes the proof of the desired claim. 
Also by symmetry, we can apply the same argument to get similar bound with di and d 2 inter¬ 
changed. 


E.l Proof of Lemma IE. II 

We show that the following procedure succeeds in producing the desired family with probability 
at least half, which proves its existence. Let d = {di + d2)/2, and suppose ^2 > di without loss 
of generality. For the choice of M' = £'■‘^2/576^ g^ch £ G [M'], generate a rank-r matrix 

©(^) g -^dixd 2 gg follows: 




(125) 


where U G is a random orthogonal basis such that U'^U = I^xr and G ig g, 

random matrix with each entry G {—1,-|-1} chosen independently and uniformly at random. 
By construction, notice that |||0^^^|||p < ('5/AAvfe)|||L'(E^^^)'^|||p = <5. 

Now, by triangular inequality, we have 


©(^ 1 ) _ ©(^ 2 ) 

> ;L 


F \/rd2 


> 


Vrd2 


[/(y(L) _ y(A))T 
y(A) _ y(A) 


—^^^—-—^11 U 

F did2\nfh2 ^ 


'^rdidl 


B 


We will prove that the first term is bounded by A > y/rd 2 with probability at least 7/8 for all M' 
matrices, and we will show that we can find M matrices such that the second term is bounded 
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by -B < S^ 2 rd 2 log(32r) log(32d) with probability at least 7/8. Together, this proves that with 
probability at least 3/4, there exists M matrices such that 


@(^i) _ 0h2) 


> 5 1- 


'2^1og(32r) log(32(i) 


did2 


> 




for all li, ^2 ^ [M] and for sufficiently large di and d 2 - 

Applying similar McDiarmid’s inequality as Eq. (69) in Appendix]^ it follows that A? > rd 2 
with probability at least 7/8 for M' = e'''^2/576 sufficiently large d 2 - 

To prove a bound on B, we will show that for a given £, 


< S^ 2 rd 2 log(32r)log(32d)| 


7 

> - 
“ 8 


(126) 


Then using the similar technique as in (72), it follows that we can hnd M = (1/4)M' matrices 
all satisfying this bound and also the bound on the max-entry in (127). We are left to prove 
(126). We apply a series of concentration inequalities. Let Hi be the event that {|((V)^^\ 1))| < 
^ 2 d 2 log(32r) for all i £ [r]}. Then, applying the standard Hoeffding’s inequality, we get that 
P{iLi} > 15/16, where is the Tth column of We next change the variables and represent 
as y/div?"tj, where u is drawn uniformly at random from the unit sphere and {/ is a r dimen¬ 
sional subspace drawn uniformly at random. By symmetry, y/diu^U have the same distribution as 
l^B. Let H 2 be the event that {\{{Ui, < ^yl 6 r{d 2 /dl) log(32r) log(32d) for all i £ [di]}, 

where Ui is the i-th row of U. Then, applying Levy’s theorem for concentration on the sphere 
p9] . we have F{H 2 \Hi} > 15/16. Finally, let be the event that {\\/di{{u,U{V^^'))'^))l\ < 
8 i/ 2 rd 2 log(32r) log(32d)}. Then, again applying Levy’s concentration, we get F {Hs\Hi, H 2 } > 
15/16. Collecting all three concentration inequalities, we get that with probability at least 13/16, 
< 8 ^ 2 rd 2 log(32r) log(32d), which proves Eq. ( 126 ). 

We are left to prove that gj-Q in f^( 85 /d 2 )C 2 Togd 7 defined in ( [ItI ). Similar to Eq. ( [tI] ), 

applying Levy’s concentration gives 


max |0 


(^)i 


< 


25-^732 log d 2 1 


ij I — 


d2 


I 


> 1 — 2 exp I ~ 2 log ^ 2 1 > 


— ri 5 


(127) 


for a fixed I £ [M']. Then using the similar technique as 
M = (1/4)M' matrices all satisfying this bound and also the 


in (72), it follows that there exists 
bound on B in Eq. (126). 
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